Contents

  • Reviews
  • The Log-Likelihood Gradient
  • Stochastic Maximum Likelihood and Contrastive Divergence
  • Pseudolikelihood
  • Score Matching and Ratio Matching
  • Denoising Score Matching
  • Noise-Contrastive Estimation
  • Estimating the Partition Function
    • Annealed Importance Sampling
    • Bridge Sampling

Reviews

  • Probability vs Compatibility
    • $p(x) = \cfrac{1}{Z}\tilde{p}(x)$
    • where $Z = \int_x \tilde{p}(x)$
    • $Z$ is Partition Function
  • Energy-based model

    • $\tilde{p}(\mathbf{x}) = \exp(-E(\mathbf{x}))$
      • to Make $\forall{\mathbf{x}},\tilde{p}(\mathbf{x}) > 0$
      • $E(\mathbf{x})$ is Energy Function
      • $\tilde{p}(\mathbf{x})$ in EBM is an example of a Boltzmann distribution.
  • MCMC

    • We can do Sampling from System on Steady State even though random initializatoin
    • Approximation for Integration

18.1 Log-likelihood Gradient

  • Gradient
  • Partition Z

18.2 Stochastic Maximum Likelihood and Contrastive Divergence

  • How to use gibbs_update (sampling conditionally and partially using MCMC)
  • How to make it Faster
  • Positive Phase vs Negative Phase
    • $\nabla_\theta \log p(x)=\nabla_\theta \log \tilde{p}(x) - \mathbb{E}_{x\sim p(x)} [\nabla_\theta \log \tilde{p}(x)]$
      • Positive: $\nabla_\theta \log \tilde{p}(x)$
      • Negative: $\mathbb{E}_{x\sim p(x)} [\nabla_\theta \log \tilde{p}(x)]$
    • Practical Calculation
      • Gradient update $=\Sigma_{x\sim data} \nabla_\theta \log \tilde{p}(x) - \Sigma_{x\sim MCMC} \nabla_\theta \log \tilde{p}(x)$

Naive Algorithm

Contrastive Divergence Algorithm (CD)

  • Motivation
    • Iteration step for gibbs_update, k=100 is too expensive computationally.
    • Reduce k into 1-20 for model on a small image patch

Stochastic Maximum Likelihood Algorithm (SML)

  • Also called Persistant Contrastive Divergence (PCD)
  • Reduce k into 1 for model on a small image patch
  • How

    • Maintain contrastive samples persistantly!!! (So Simple)
  • Theree are additional approach to accelerate SML called as Faster PCD.

18.3 Pseudolikelihood

  • Set $\log p(\vec{x}) \approx \sum_d \log p(x_d|\vec{x}_{-d})$
  • Use Conditional Probability property
    • $p(x_1|x_2)=\cfrac{p(x_1,x_2)}{p(x_2)}=\cfrac{\tilde{p}(x_1,x_2)}{\tilde{p}(x_2)}=\cfrac{\tilde{p}(x_1,x_2)}{\int_{x_1}\tilde{p}(x_1,x_2)}$
  • e.g.
    • If we want to calculate $p([1,1,1,1])$ where each state = 1~3.
      • We have to evaluate $p(x_1=1|x_2=1, x_3=1, x_4=1])$
        • By calculating model $\tilde{p}$
          • $\tilde{p}(x_1=1||x_2=1, x_3=1, x_4=1)$
          • $\tilde{p}(x_1=2||x_2=1, x_3=1, x_4=1)$
          • $\tilde{p}(x_1=3||x_2=1, x_3=1, x_4=1)$
        • $p(x_1=1|x_2=1, x_3=1, x_4=1]) = \cfrac{\tilde{p}(x_1=1||x_2=1, x_3=1, x_4=1)}{\sum_k \tilde{p}(x_1=k||x_2=1, x_3=1, x_4=1)}$
      • iterate D step
      • Finally
        • $p(x_1=1|[1, 1, 1]) \times p(x_2=1|[1, 1, 1]) \times p(x_3=1|[1, 1, 1]) \times p(x_4=1|[1, 1, 1])$
    • $K^D$ -> $K\times D$
  • This may look like an unprincipled hack, but it can be proven that estimation by maximizing the pseudolikelihood is asymptotically consistent
  • "Generalized" Pseudolikelihood

    • Make dimensions into $m$ Partitions
      • If $m=1$
        • No partition
        • Exact likelihood
      • If $m=D$
        • Partition for a individual dimension
        • Same as Pseudolikelihood
      • If $1<m<D$
        • $\log p(x) \approx \sum_m p(x_m|\vec{x}_{-m})$
  • It can perform better than maximum likelihood for "tasks that require only the conditional distributions used during training"

    • such as filling in small amounts of missing values.
  • Generalized pseudolikelihood techniques are especially powerful
    • If designed to capture the most important correlations
      • For example, in natural images, pixels that are widely separated in space also have weak correlation

18.4 Score Matching and Ratio Matching

Score Matching

  • Like Pseudolikelihood, Score Matching does not approximate Partition $Z$
  • Score
    • $\nabla_x \log p(x)$
  • Approximation

Ratio Matching

  • A more successful approach to extending the basic ideas of score matching to discrete data
  • Ratio?

In [ ]: